26/09/22


Causal ML
___________

Can view a SCM through the view of a downstream feature Y, other features comprise either causal parents of Y or causally irrelevant features (at least given the parents). We can think of the salient cause of Y to be some latent function of the parents, called a content feature C, and everything else is spurious features S.

Strategies have been identified to align model fitting with the perspective of identifying C and separating it from S for more generalization and to better reflect true causal relationships. In other words to avoid overfitting to a spurious relationship.

Data augmentation is one such technique. It can be viewed through the lens of intervening on spurious features while maintaining the same latent meaningful content features. For example in some domains rotating or shifting an image will not affect the salient content that determines the true downstream label associated with the image.

Some work produces counterfactual examples through data augmentation. For example editing documents in such a way that changes their label. We can thus produce positive and negative samples that can be used to compute representations via contrastive learning.

There is a method involving a similar idea, where samples with contrasting labels but similar content (raw features?) are incorporated. Perhaps the goal is to force granularity into the model, which is desirable for avoiding spurious associations, and to do this the model would need salient content representations. A technique to compute good representations is to use regularization to align model (function) gradients with counterfactuals examples’ feature deltas. In other words, if we have two counterfactual samples (x_i, y_i) and (x_j, y_j) that have different labels y but similar features x, we would compute the model pass f(x_i) and compute the gradient wrt features x (at least that’s what I understood), let’s call it df(x_i)/dx. We also compute a difference x_j - x_i. We want the gradient to be similar to the delta. Perhaps the reasoning is as follows. Since the two samples are similar in feature space but have different labels, we can view this delta as a minimal translation required to move x_i to a different label. The gradient wrt x is also interpreted as a direction in which we can move x to increase the result of the function evaluated at x_i. Now their logic gets more muddy but one way to understand their methodology could be to view f as a loss function, which we want to be as low as possible. Then if the gradient wrt x is similar to the delta (minimal translation/direction to change labels) then that would mean the quantity to add to x to increase our loss function is similar to the quantity to add to x to change labels. If we have a feature set x, and we add some quantity d to it that is enough to significantly increase the value of the loss function f, which is the deviation between our estimated function g and the ground truth, and this d is very similar to the quantity d’ that is enough to change the label of x according to the ground truth function, this equivalence implies that our estimated function g is similar to the ground truth function. If d is significantly different from d’ then our function has learned a different categorization than that of the ground truth, as we to increase the loss according to our function g we would go a different direction than that of the true label boundary. So minimizing the distance between d and d’ is expected to help generalize and get closer to the ground truth function and boundaries. They do this via regularization, in other words they softly constrain the function g to satisfy this desired alignment between its gradient wrt x (in other words, d) with the computed deltas (approximations of d). This method definitely has problems as it hinges on very imperfect human bias but it could work and improve generalization to some degree.

Meta thought: this approach of writing and summarizing might be a tad inefficient.